The AI Hierarchy — What Fits Where
Every AI buzzword maps to a specific layer in a hierarchy. Understanding this hierarchy is the single most important thing before you walk into any AI conference.
Key Distinction: AI vs ML vs Deep Learning
| Concept | What It Is | Example |
|---|---|---|
| AI | Any system that performs tasks typically requiring human intelligence. Includes rule-based systems. | A chess engine with hardcoded rules. An if-else fraud filter. |
| Machine Learning | A subset of AI. The system learns patterns from data instead of being explicitly programmed. | Spam filter that learns from labeled emails. Recommendation engines. |
| Deep Learning | A subset of ML using neural networks with many layers. Excels at unstructured data. | ChatGPT, image recognition, voice assistants. |
| Generative AI | A subset of DL that creates new content — text, images, code, audio. | Claude writing an email. DALL-E generating an image. |
CTO mental model: All generative AI is deep learning. All deep learning is ML. All ML is AI. But not all AI is ML — rule-based expert systems are AI but not ML.
Types of Machine Learning
Supervised Learning
Learns from labeled examples. Input → known output. Used for classification, regression, forecasting.
Unsupervised Learning
Finds patterns in unlabeled data. No "right answer" given. Used for segmentation, anomaly detection.
Reinforcement Learning
Agent learns by trial-and-error, maximizing a reward signal. Used for game AI, robotics, RLHF for LLMs.
Self-Supervised Learning
The model creates its own labels from the data. "Predict the next word." This is how LLMs are pre-trained.
How AI Actually Learns — From Data to Intelligence
Neural Networks: The Core Mechanism
A neural network is a function that takes input numbers and produces output numbers, with adjustable parameters (called weights) in between. "Learning" means adjusting those weights to minimize errors.
The Training Loop (every AI model follows this)
- 1Forward pass: Feed input data through the network. It produces a prediction.
- 2Loss calculation: Compare the prediction to the correct answer. Measure how wrong it was (the "loss").
- 3Backpropagation: Calculate how each weight contributed to the error.
- 4Weight update: Adjust weights slightly to reduce the error (using gradient descent).
- 5Repeat: Do this billions of times across the entire dataset. Each full pass = one "epoch."
What Makes Deep Learning "Deep"
A shallow network has 1-2 hidden layers. A deep network has dozens to hundreds. Each layer learns increasingly abstract features:
The Transformer Architecture (2017 — the breakthrough)
Before transformers, AI processed text word-by-word sequentially (slow, forgetful). The transformer introduced self-attention: the model can look at all words in a sentence simultaneously and learn which words relate to which.
Why it matters: Every major LLM today — GPT-4, Claude, Gemini, Llama — is a transformer. The 2017 Google paper "Attention Is All You Need" is the single most important AI paper of the decade.
Key Numbers (to have in your back pocket)
| Metric | What It Means | Typical Values |
|---|---|---|
| Parameters | The adjustable weights in the model. More ≈ more capacity to learn. | GPT-4: ~1.8T, Llama 3: 8B-405B, Claude: undisclosed |
| Context Window | How much text the model can "see" at once (input + output). | Claude: 200K tokens. GPT-4: 128K. Gemini: 1M+ |
| Tokens | Chunks of text (~0.75 words per token). The unit of measurement for LLMs. | This entire page ≈ 4,000 tokens |
| Training Data | Total text the model was trained on. | Typically trillions of tokens from books, web, code |
| Inference | Running the trained model to generate a response. What you pay for via API. | ~$3-15 per million input tokens (varies by model) |
Large Language Models — Plus SLMs & Frontier Models
What an LLM Actually Does
An LLM is a next-token predictor. Given a sequence of tokens, it predicts the probability distribution over all possible next tokens, then samples from that distribution. That's it. All the apparent "intelligence" emerges from doing this prediction extremely well over extremely large amounts of data.
The Three Phases of Building an LLM
Phase 1 — Pre-training (costs $10M-$100M+): Feed the model trillions of tokens of text from the internet, books, code. Produces a "base model" — it can complete text but won't follow instructions.
Phase 2 — Fine-tuning (SFT): Train on curated instruction/response pairs. Teaches the model to be a helpful assistant.
Phase 3 — RLHF: Human raters rank multiple model responses. A reward model learns what humans prefer. This is what makes Claude polite, safe, and genuinely useful.
SLM vs LLM vs Frontier Model — Match the Model to the Task
The terms SLM (Small Language Model), LLM (Large Language Model), and FM (Frontier Model) aren't three separate categories — LLM is the umbrella term. But they're labeled differently because we use them differently.
SLM — Small Language Model
Parameters: <10B. Role: Efficient specialist. Fast, cheap, runs on-prem. Best for: document classification, code routing, summarization. Well-tuned SLMs can match bigger models at focused tasks. Examples: IBM Granite 4.0, Mistral small models.
LLM — Large Language Model
Parameters: 10B-100B+. Role: Generalist. Broad knowledge across many domains. Best for: complex customer support, nuanced reasoning, multi-domain synthesis. Runs in cloud/SaaS.
FM — Frontier Model
Parameters: 100B+. Role: Cutting-edge. Best reasoning, best at complex multi-step tasks, deep tool integration. Best for: autonomous incident response, agentic systems, complex planning. Examples: Claude Opus, GPT-5, Gemini Pro.
Decision heuristic: Use an SLM when you need speed, low cost, or on-prem control. Use an LLM when you need broad knowledge and nuanced reasoning. Use a Frontier Model when you need the absolute best complex reasoning for multi-step problems. Match the model to the task — don't use a sledgehammer for a thumbtack.
Key LLM Capabilities
In-Context Learning
Give the LLM examples in the prompt, it adapts behavior without retraining. "Few-shot" prompting.
Chain of Thought
Ask it to "think step by step" and accuracy on reasoning tasks jumps dramatically.
Tool Use / Function Calling
The LLM outputs structured JSON to call external APIs, databases, or tools. Foundation of agents.
RAG
Before answering, retrieve relevant documents from a database and inject them into the context. Reduces hallucination.
What LLMs Cannot Do
• No true memory: Each conversation starts fresh unless you engineer persistence.
• Hallucinations: They confidently state false things. Inherent to probabilistic generation.
• No real-time data: Knowledge is frozen at training cutoff unless you add retrieval tools.
• Math and precise logic: Unreliable for complex calculations without tool use. They approximate; they don't compute.
• Determinism: Same input can produce different outputs. Temperature controls randomness but never eliminates it.
The Open Model Ecosystem — Hugging Face & When to Use It
What is Hugging Face?
Hugging Face is the largest public repository of pre-trained AI models. Think of it as the “GitHub for machine learning” — a shared platform where research labs, companies, and independent developers upload trained models, datasets, and application demos. The models range from tiny text classifiers to massive LLMs.
What lives on Hugging Face?
- Language models — Llama, Mistral, Gemma, BERT, and thousands of fine-tuned variants.
- Imaging models — Stable Diffusion, BLIP, Vision Transformers.
- Speech models — Whisper (ASR), Bark & XTTS (text-to-speech).
- Embedding models — the backbone of RAG (e.g.,
all-MiniLM-L6-v2). - Multimodal models — models that jointly process text, images, and sometimes audio.
Base Models vs Fine‑Tuned Models — Where the Cost Really Goes
Training a model from random weights (a base model) demands enormous compute. For example, Google’s BERT needed 4 days on 64 TPUs at an estimated hardware cost of $50k–$100k, while today’s largest LLMs run into tens of millions of dollars just for the electricity and GPUs. These base models are built by well‑funded labs — Google, Meta, OpenAI, Mistral, Stability AI — who then release the finished weights publicly on Hugging Face.
When you “get a model from Hugging Face,” you almost never train from scratch. Instead, you download the open‑sourced weights and fine‑tune them on your own much smaller dataset. Fine‑tuning adjusts only the final layers (or a fraction of the total parameters) and can be done on a single GPU in hours for a few dollars. That is why a small team can build a custom medical‑document classifier or a support‑ticket router for a fraction of what the original training cost.
Base Model (Training from Scratch)
Who pays: Big Tech or well‑funded research groups
Compute: Hundreds of GPUs/TPUs for weeks or months
Cost: Millions of dollars
Output: A general‑purpose brain that understands language or images
Fine‑Tuned Model (Your Work)
Who pays: Your team
Compute: A single GPU for hours
Cost: Tens to hundreds of dollars
Output: A specialist that excels at one narrow task using the base model’s knowledge
Key insight: The expensive bit — learning grammar, common sense, visual features — has already been done. Hugging Face gives you a starting model that already knows what a sentence or an edge looks like. You spend a tiny amount to teach it your domain‑specific patterns.
When Should You Pull a Model from Hugging Face?
Enterprises and developers usually turn to Hugging Face in these concrete scenarios:
| Situation | Why Hugging Face (instead of a closed API) |
|---|---|
| Data must stay on‑prem / in your VPC | Download an open‑source LLM (Llama, Mistral, Gemma) and run it on your own servers. No data ever leaves your infrastructure. |
| Task is narrow and high‑volume | A fine‑tuned BERT‑family model for classification can be 100× cheaper per query than calling a GPT‑4 API and often just as accurate. |
| Cost at scale | For millions of inference requests per day, self‑hosting a small model on your own GPU instance usually beats pay‑per‑token pricing. |
| Avoiding vendor lock‑in | Open models are portable — you can move them between clouds or run them on‑prem, and swap providers freely. |
| R&D / prototyping | Experiment with different architectures without API bills. Test accuracy, speed, and failure modes before committing to a production stack. |
| Transparency & auditability | You can inspect the model card, training data, and even run bias and safety checks on open models — impossible with a closed API. |
| Embeddings / RAG pipeline | Hugging Face hosts state‑of‑the‑art embedding models that convert text into vectors for semantic search, often the best‑performing options available. |
Trade‑off: Self‑hosting an open model requires you to manage infrastructure, security, and model updates. If you lack in‑house ML engineering, a hosted API may still be the faster, safer choice.
How to Build a Model for Hugging Face (The Quick Version)
The typical workflow to create and share a model is straightforward:
- 1Define the task — text classification, image recognition, question‑answering, etc.
- 2Collect a dataset — labeled examples specific to your problem.
- 3Choose a pre‑trained base model from Hugging Face that is already close to what you need (e.g., BERT for text, ViT for images).
- 4Fine‑tune using the Hugging Face
transformerslibrary — a trainer handles the training loop. - 5Evaluate on a held‑out test set to confirm performance.
- 6Package the trained weights, config, and tokenizer into a single folder.
- 7Write a model card (README) describing what the model does, its training data, limitations, and intended use.
- 8Upload via the Hugging Face Hub — either drag‑and‑drop on the website or a single command with
huggingface_hub.
The entire process can be done in a few hours, and many teams start from a community‑shared Colab notebook that already does steps 2–6 with a single click.
The AI Provider Landscape
Foundation Model Providers
| Provider | Models | Strengths | Access |
|---|---|---|---|
| Anthropic | Claude (Opus, Sonnet, Haiku) | Safety, long context (200K), instruction following, coding, analysis | API, claude.ai, AWS Bedrock, GCP Vertex |
| OpenAI | GPT-4o, o1, o3 | Broad capabilities, vision, ecosystem, first-mover brand | API, ChatGPT, Azure OpenAI |
| Gemini (Ultra, Pro, Flash) | Multimodal, huge context (1M+), integrated with Google Cloud | API, Gemini app, GCP Vertex | |
| Meta | Llama 3/4 | Open-source, self-hostable, fine-tunable, no vendor lock-in | Download weights, run anywhere |
| Mistral | Mixtral, Mistral Large | European, efficient, open-weight options | API, self-host |
Open Source vs Closed Source — CTO Decision Framework
| Factor | Closed (Claude, GPT-4) | Open (Llama, Mistral) |
|---|---|---|
| Performance | Generally best-in-class | Closing the gap rapidly |
| Cost | Pay per token (API) | Infra cost (GPUs) — can be cheaper at scale |
| Data privacy | Data sent to provider's API (enterprise tiers offer zero retention) | Runs on your infra — full control |
| Customization | Prompt engineering, some fine-tuning | Full fine-tuning, modify architecture |
| Maintenance | Provider handles everything | You own ops, updates, security |
| Best for | Fast deployment, best quality, small-medium scale | High volume, strict compliance, niche domains |
AI Agents & Agent Skills
What is an AI Agent?
An AI agent is an LLM that can plan, use tools, observe results, and iterate — autonomously. Instead of just answering a question, it takes action to accomplish a goal. This is the shift from monolithic models to compound AI systems — where the model is integrated into existing processes with programmatic components around it.
Agent vs Automation vs Chatbot — The Critical Distinction
| Feature | Traditional Automation (RPA) | Chatbot (rule-based) | AI Chatbot (LLM) | AI Agent |
|---|---|---|---|---|
| Decision making | None. Follows fixed rules. | Decision tree only | Flexible, but single-turn | Plans multi-step, adapts |
| Handles ambiguity | No — breaks on edge cases | No | Yes | Yes |
| Uses tools | Hardcoded integrations | No | If programmed to | Autonomously decides which tools |
| Memory | None | Session only | Session only | Short + long-term memory |
| Autonomy | Zero | Zero | Low | High — can loop, retry, escalate |
The Agent Loop (ReAct Pattern)
- 1Observe: Receive user request or trigger event. Retrieve relevant context from memory.
- 2Think: The LLM reasons about what to do next. Creates a plan or picks the next action.
- 3Act: Call a tool — query a database, call an API, send a message, update a record.
- 4Observe: Check the result. Did it work? Was the data correct?
- 5Loop or stop: If the goal is met, respond. If not, go back to step 2 with updated context.
AI Agent Skills — Procedural Knowledge for Agents
LLMs know facts (semantic memory) but lack procedural knowledge — the step-by-step workflows specific to how work actually gets done. Agent Skills solve this by packaging procedural knowledge into a simple, portable format.
What a Skill Looks Like
A skill is simply a skill.md file in a folder. At minimum it has:
- Name — identifies the skill
- Description — tells the agent when to use this skill (the trigger condition)
- Body — step-by-step instructions, rules, examples in plain markdown
Optional folders: scripts/ (executable Python/JS/Bash), references/ (additional docs), assets/ (templates, data files).
Progressive Disclosure — Three Tiers
When an agent has hundreds of skills, loading all of them into the context window would blow the token budget. Skills use progressive disclosure:
How skills relate to MCP and RAG: MCP gives agents tool access (what the agent can reach). RAG gives factual knowledge (reference material). Skills give procedural knowledge — how to do things, in what order, with what judgment. Skills often use MCP tools, with the skill providing the judgment for when and how to invoke them. The skill.md format is an open standard (Apache 2.0) at agentskills.io, adopted across Claude Code, OpenAI Codex, and many other platforms.
Trust warning: Skills can include executable scripts with access to file systems and API keys. Always review a skill before installing it — audits have found prompt injection, tool poisoning, and hidden malware in publicly available skills. Treat skill installation like any software dependency.
Agent Architecture Components
LLM Core
The reasoning engine. Chooses actions, interprets results.
Tools
Functions the agent can call: APIs, DB queries, web search, calculators.
Memory
Short-term: conversation context. Long-term: vector DB of past interactions.
Guardrails
Rules constraining what the agent can do. Approval workflows for high-stakes actions.
Orchestrator
Manages the agent loop. Handles retries, timeouts, error handling.
Observability
Logging every step: what the agent thought, what tools it called, what it returned.
Building AI Applications — RAG, CAG & Multimodal RAG
The AI Application Stack
RAG: The Most Common Enterprise AI Pattern
RAG (Retrieval-Augmented Generation) is how you make an LLM answer questions about your data without retraining the model. It's a compound AI system: the model queries an external searchable knowledge base, retrieves relevant documents, and uses them as context for generation.
How RAG Works — Step by Step
- 1Ingest documents: Take your internal docs (PDFs, Confluence, Slack, CRM). Split into chunks (200-500 tokens each).
- 2Create embeddings: Run each chunk through an embedding model. Converts text to a numerical vector (~1500 numbers) capturing semantic meaning.
- 3Store in vector database: Store vectors in a vector DB (Pinecone, pgvector, Qdrant, Chroma). Enables fast similarity search.
- 4At query time: Convert the user's question to a vector using the same embedding model.
- 5Search: Find the top 5-20 most similar document chunks in your vector DB.
- 6Augment: Insert those chunks into the LLM prompt as context.
- 7Generate: The LLM answers based on the retrieved context, dramatically reducing hallucination.
RAG at Scale: Millions of Documents
When people talk about RAG over millions of PDFs, they're describing a search system + an LLM — not just a vector database demo.
1. Ingestion (offline)
Take PDFs from cloud storage. OCR if scanned. Clean up text. Split into chunks of ~512-2000 words with overlap. Attach metadata — document ID, page number, section title, date.
2. Embeddings + Index
Run dedicated embedding jobs (GPUs or large batches). Build a distributed index (Milvus, Qdrant, Vespa, Elasticsearch). Use efficient algorithms (HNSW, IVF, PQ). Metadata filtering happens first — vector search helps rank, not replace, filtering.
3. Retrieval + Generation
4. Caching
Cache: query → final answer; query → list of relevant chunk IDs; hot data (frequently used vectors). Real request path: User query → cache → if not found → retrieve + LLM → save to cache.
5. Monitoring
Track retrieval quality, answer quality (thumbs up/down), latency, cache hit rate. Re-embed and re-shard when models or data change.
CAG: Cache Augmented Generation — An Alternative to RAG
CAG takes a different approach: instead of retrieving knowledge on demand, you preload the entire knowledge base into the model's context window all at once. The model processes everything in a single forward pass and stores its internal state (the KV cache — key-value cache). Subsequent queries use this cached state without reprocessing all the text.
RAG — Retrieve on Demand
Knowledge base: Can be massive (millions of docs). Only retrieves small pieces at a time.
Latency: Higher — extra retrieval step per query.
Data freshness: Easy — update the index incrementally.
Best for: Large, dynamic knowledge bases; when citations are needed.
CAG — Preload Everything
Knowledge base: Constrained by context window size (32K-100K tokens typical).
Latency: Lower — no retrieval lookup; one forward pass.
Data freshness: Requires recomputation when data changes.
Best for: Small, static knowledge bases; when low latency matters.
RAG or CAG? Use RAG when your knowledge source is very large, frequently updated, or you need citations. Use CAG when you have a fixed set of knowledge that fits within the context window, latency is critical, and you want simpler deployment. For complex scenarios (like clinical decision support), a hybrid approach works: RAG to retrieve the relevant subset, then CAG to create temporary working memory for follow-up questions.
Multimodal RAG — Three Approaches
Real-world data isn't just text. It includes network diagrams, screenshots, scanned PDFs, videos, and audio. Multimodal RAG extends retrieval to handle multiple data modalities.
Approach 1: Text-ify Everything RAG
Convert all modalities to text first. Images → captions via captioning model. Audio/video → transcripts via STT. Then use standard text RAG. Easy but loses visual context and spatial relationships.
Approach 2: Hybrid Multimodal RAG
Retrieval is still text-based (search over captions + transcripts), but the LLM receives the original non-text data (images, audio clips) alongside retrieved text. The multimodal LLM reasons over everything together. Retrieval is only as good as the text captions.
Approach 3: Full Multimodal RAG
Uses a shared vector space — text, images, and audio all get embedded into the same space. A single query vector can directly retrieve text paragraphs, diagrams, and video frames. Most powerful but highest cost and complexity.
Native Multimodality vs Feature-Level Fusion
Feature-level fusion: A separate vision encoder extracts features from images and passes numerical representations to the LLM. The LLM only sees a summarized description, not the raw signal. Cheaper but information can be lost.
Native multimodality: All modalities (text, images, audio, video) are tokenized and embedded into a shared vector space. The model attends to everything simultaneously. For video, this uses spatial-temporal patches — 3D cubes capturing motion across frames, not just flat squares. Native models can also do any-to-any generation: take in any combination of modalities and output any combination.
RAG pitfall: "Garbage in, garbage out" applies fully. Data quality work is 60% of a RAG project. If your documents are messy or out of date, RAG will confidently retrieve bad information.
Key Frameworks & Tools
LangChain
Python/JS framework for LLM apps. Chains prompts, tools, memory, retrievers. Widely used but can be over-abstracted.
LlamaIndex
Focused specifically on RAG. Better for document indexing and retrieval pipelines.
CrewAI / AutoGen
Multi-agent frameworks. Define agents with different roles that collaborate on complex tasks.
Vercel AI SDK
Lightweight SDK for building AI chat UIs in Next.js/React. Handles streaming, tool calls.
Practical: Building an AI Feature (Simplified)
Building "An AI that answers customer questions using your knowledge base":
| Step | What You Do | Tools/Services |
|---|---|---|
| 1. Data prep | Export KB articles, clean HTML, split into chunks | Python, Unstructured.io |
| 2. Embeddings | Generate vector embeddings for each chunk | OpenAI Embeddings, Cohere, Voyage AI |
| 3. Vector store | Store embeddings with metadata | Pinecone, pgvector, Qdrant |
| 4. Retrieval | Build search: query → vector → top-k similar chunks | Vector DB SDK + Cohere Rerank |
| 5. Prompt | System prompt with role + injected context chunks | Prompt templating |
| 6. LLM call | Send assembled prompt to Claude/GPT-4 API | Anthropic API, OpenAI API |
| 7. UI | Chat interface with streaming | React, Vercel AI SDK |
| 8. Guardrails | Input validation, output filtering | Custom rules, Guardrails AI |
| 9. Observability | Log requests, responses, latency, cost | LangSmith, Langfuse, Datadog |
| 10. Evaluation | Measure answer quality, hallucination rate | Human review, RAGAS framework |
Multimodal AI — How Models See, Hear & Understand
What is Multimodal AI?
A modality is a type of data — text, images, audio, video, LIDAR, thermal imaging. A multimodal AI model can ingest and/or generate multiple data modalities. Instead of just tokenizing text strings, it can process a screenshot alongside a text description, or generate a video from a text prompt.
Native Multimodality: The Shared Vector Space
In a natively multimodal model, all modalities are tokenized and embedded into the same high-dimensional space. Text words become vectors. Image patches become vectors. Audio chunks become vectors. Because they all live in the same space, a picture of a cat ends up near the word "cat" — and the model can reason about them together without translating between different systems.
Video & Temporal Reasoning
Native video models use spatial-temporal patches — 3D cubes that capture an area across a short window of time (e.g., 8 video frames). Motion is baked into the token itself, not guessed by comparing separate images. This enables the model to understand actions like "picking up" vs "putting down."
Any-to-Any Generation
Because all modalities share the same vector space, multimodal models can do any-to-any generation: take in any combination of modalities and output any combination. You could ask the model to explain how to tie a tie — it generates text instructions and a short video clip, all coherent because everything lives in the same shared space.
Feature-Level Fusion vs Native Multimodality
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Feature-Level Fusion | Separate vision encoder extracts features; passes numerical array to LLM | Cheaper, easier to swap parts | Information lost in transfer; LLM sees summary, not raw data |
| Native Multimodality | All modalities tokenized into shared vector space; model attends to everything simultaneously | Richer understanding; model knows where to look based on the question | More compute, more complexity |
Why this matters: With feature-level fusion, the vision encoder processes your image before it knows what question you're asking — it might compress away the exact detail you need. With a shared vector space, the model attends to text and images simultaneously, so it knows where to look. Ask about a tiny icon in the corner of a screenshot, and the model can focus attention there.
Voice AI — Calls, IVR, and Conversational Agents
How Voice AI Works End-to-End
ASR/STT: Converts spoken audio to text. Leading: OpenAI Whisper (open source), Google Cloud Speech, Deepgram (low latency), AssemblyAI.
LLM Processing: Same as any text-based AI. Transcribed text is the input.
TTS: Converts LLM's text response to natural speech. Leading: ElevenLabs (most natural), OpenAI TTS, Google Cloud TTS, Play.ht, Cartesia (ultra-low latency).
Voice AI Use Cases
Welcome / Outbound Calls
AI calls new customers to welcome them, walk through onboarding. Platforms: Bland.ai, Retell AI.
Customer Support IVR
Replace "Press 1 for billing" with natural conversation. AI understands intent, resolves or routes.
Appointment Scheduling
AI calls to confirm/reschedule appointments. Used in healthcare, salons, auto services.
Sales Qualification
AI calls inbound leads, asks qualifying questions, logs to CRM. Example: Air AI, Vapi.
Key Decisions for Voice AI
| Decision | Options | Trade-off |
|---|---|---|
| Latency target | <500ms feels natural, >1s feels robotic | Lower latency = more expensive, requires streaming ASR+TTS |
| Build vs buy | Platforms: Vapi, Retell, Bland.ai. Build: Twilio + ASR + LLM + TTS | Platforms faster to ship. Custom gives full control. |
| Interruption handling | Must detect user speaking mid-response and stop gracefully | Hard to get right. Requires VAD (Voice Activity Detection). |
| Phone integration | Twilio, Vonage, Plivo for SIP/PSTN | Twilio is most mature. Costs per minute apply. |
Enterprise AI — Practical Considerations & Technical Debt
Data Privacy & Security
Zero Data Retention (ZDR): Enterprise API tiers guarantee your data is not used for training and is not retained. Verify this in your contract.
Data residency: Run Claude via AWS Bedrock in your preferred region. Data never leaves your VPC. Same with Google Vertex AI.
PII handling: Redact PII before sending, or use enterprise tiers with DPAs. Check SOC2 Type II, HIPAA BAA compliance.
Cost Management
LLM API Pricing Model
You pay per token — both input (prompt) and output (response). ~$0.01-0.05 per ~3500 tokens depending on model.
Cost Optimization Strategies
Model routing: Use a cheap/fast model (Claude Haiku, GPT-4o mini) for simple queries. Route complex queries to the expensive model. Can cut costs 60-80%.
Caching: Cache frequent queries. Anthropic offers prompt caching — reused system prompts cost a fraction.
Shorter prompts: Every token in your system prompt is charged on every request. Optimize for brevity.
Batch processing: For non-real-time tasks, use batch APIs at 50% discount.
AI Technical Debt — The Elephant in the Room
AI technical debt is trading off speed now for costs later — future cost from present shortcuts. It's the interest you pay because you didn't make a large enough down payment upfront. In AI, debt compounds even faster because AI is probabilistic, context-dependent, and moves extremely fast.
Strategic vs Reckless Technical Debt
Strategic: Taken consciously. You know the risks, they're documented, time-bound, with a remediation plan. A valid way to get to market fast.
Reckless: Poor discipline. No documentation, no remediation plan, no future — just a mess headed your way.
Four Categories of AI Technical Debt
1. Data Debt
Garbage in = amplified garbage out. Unvetted sources, bias, data drift, poisoning, no anonymization. Fix: Vet sources, check for bias, monitor drift, anonymize PII.
2. Model Debt
No version control, no evaluation metrics, no rollback ability, no penetration testing. Fix: Version models, set eval metrics, plan rollbacks, pen-test.
3. Prompt Debt
Undocumented system prompts, no input validation, prompt injection vulnerabilities, data leakage via outputs, no guardrails. Fix: Document prompts, validate inputs, use an AI gateway with input/output filtering.
4. Organizational Debt
Unclear ownership, no governance policy, no red teaming, scalability issues, latency surprises at scale. Fix: Define ownership, establish governance, red-team, plan for scale.
The result of unchecked AI technical debt: An AI you don't trust. "Ready, fire, aim" doesn't work. The project lifecycle hasn't changed just because it's AI: Requirements → Architecture → Implementation → Testing → Deployment → Evaluation → feed back to Requirements. AI technical debt = speed minus discipline, with massive compounding interest.
Team Structure for AI
| Role | What They Do | Typical Background |
|---|---|---|
| AI/ML Engineer | Builds pipelines, integrates LLMs, manages RAG, fine-tuning | Software engineer + ML experience |
| Prompt Engineer | Designs and optimizes system prompts, evaluates output quality | Domain expert + writing skill |
| Data Engineer | Prepares, cleans, and pipelines data for RAG / training | Data engineering, ETL |
| Platform/Infra | Manages GPUs, vector DBs, model serving, observability | DevOps / SRE with ML infra |
Salesforce AI Ecosystem
Salesforce AI Architecture
What Salesforce "Agentforce" Actually Is
Agentforce = LLM (multi-model routing) + Salesforce data (CRM, Data Cloud, Knowledge Base) + Tools (Salesforce actions: create case, update opportunity, send email) + Guardrails (Trust Layer + business rules) + Deployment (Service Cloud, Sales Cloud, web, Slack).
What Questions to Consider
• "How does Agentforce handle multi-step tool failures and retries?"
• "What's the latency overhead of the Trust Layer on each LLM call?"
• "Can I bring my own model (BYOM) and still use the orchestration layer?"
• "How does Data Cloud grounding work — is it RAG under the hood? What embedding model?"
• "What's the pricing model — per agent, per conversation, per action?"
• "How do I evaluate agent quality? Is there built-in testing/evaluation tooling?"
• "What observability do I get — can I see every reasoning step, tool call, and retrieval?"
Risks, Limitations & What Can Go Wrong
Hallucination
LLMs generate plausible-sounding false information. Mitigation: RAG, citations, confidence scoring, human-in-the-loop.
Prompt Injection
Malicious inputs that override system instructions. Mitigation: input sanitization, separate system/user contexts, output validation.
Data Leakage
Model reveals training data or other users' data. Mitigation: data isolation, output filtering, PII redaction.
Bias
Models inherit biases from training data. Mitigation: bias testing, diverse data, human oversight.
Cost Overruns
Token costs spike unexpectedly. One rogue agent loop burns budget. Mitigation: per-request budgets, circuit breakers, cost monitoring.
Vendor Lock-in
Building on one provider's API creates dependency. Mitigation: abstract the model layer, support model switching.
CTO Strategy — Where to Start & LLM vs Agent Decision Framework
The Pragmatic Adoption Ladder
- 1Internal productivity (low risk, high value): Deploy an LLM chatbot connected to your internal knowledge base. Quickest win with least risk.
- 2Customer-facing copilot (medium risk): AI that helps customers with common questions, grounded in your help center. Always with a "talk to human" escape hatch.
- 3Process automation agents (higher risk): Agents that take actions — update records, send emails, process refunds. Requires guardrails, approval workflows.
- 4Autonomous agents (highest complexity): Multi-step agents handling end-to-end workflows with minimal human oversight. Only after robust evaluation, monitoring, and rollback.
LLM vs Agent — When Simple is Better
A common mistake is building an elaborate agent with multi-step planning and tool use when a single LLM prompt would do the job faster and cleaner. Sometimes simple is better.
Use a Single LLM When…
• Task is single-step (quick answer, one-off task)
• Low complexity — no need for planning or external tools
• Speed matters — want fast results without overhead
• Examples: writing an email, summarizing a document, translating text, generating ideas, simple code snippets
Use an Agent When…
• Task requires multi-step reasoning and planning
• Need to use tools — APIs, databases, external systems
• Autonomy is required — the system decides what steps to take and in what order
• Examples: automating workflows, researching competitors + compiling + emailing reports, debugging + testing + deploying code
The LLM vs Agent heuristic: If you can describe the task as a single question with a single answer, use an LLM. If the task requires a workflow — pull data, run analysis, create a chart, email it — use an agent. Next time you're building with AI, ask: do I really need an agent, or will a simple LLM do?
The #1 mistake CTOs make: Starting with a model choice instead of starting with the problem. Pick a specific, measurable business problem first. Then figure out which AI approach solves it. Often you don't need the most expensive model — or any model at all.
Quick Reference: The AI Glossary
| Term | Plain English |
|---|---|
| Token | A chunk of text (~¾ of a word). The unit LLMs process and bill by. |
| Embedding | Converting text to a list of numbers that capture meaning. Similar texts → similar numbers. |
| Vector Database | A database optimized for storing and searching embeddings by similarity. |
| RAG | Retrieval-Augmented Generation. Look up relevant docs, then feed them to the LLM. |
| CAG | Cache Augmented Generation. Preload entire knowledge base into context; use KV cache for fast queries. |
| Fine-tuning | Additional training on specific data to specialize a model. Expensive, usually unnecessary. |
| Prompt Engineering | Crafting the instructions (system prompt) to get the best output from an LLM. |
| Temperature | Controls randomness. 0 = deterministic, 1 = creative. Use low for facts, higher for brainstorming. |
| Context Window | Max text the model can process at once. Bigger = more info per request, but more expensive. |
| RLHF | Reinforcement Learning from Human Feedback. How models learn to be helpful vs. just technically correct. |
| Inference | Running a trained model to get a prediction/response. The thing you pay for in production. |
| Hallucination | When the model confidently outputs false information. |
| Guardrails | Rules that constrain AI behavior — what it can/can't do, say, or access. |
| MCP | Model Context Protocol. Open standard for connecting LLMs to external tools and data sources. |
| Function Calling | The LLM outputs structured data to invoke external tools/APIs. Mechanism behind agents. |
| Agentic | AI that can plan, act, observe, and loop — not just respond to a single prompt. |
| SLM | Small Language Model. <10B parameters. Fast, cheap, on-prem specialist. |
| Frontier Model | Most capable models today. Best reasoning, best at complex multi-step tasks. |
| KV Cache | Key-Value Cache. Model's internal state after digesting documents; used in CAG. |
| Progressive Disclosure | Loading skill metadata first, full instructions when relevant, resources at point of need. |
| ReAct | Reasoning + Acting. The think-act-observe loop pattern used by most AI agents. |
| Shared Vector Space | Single embedding space where text, images, and audio all coexist; enables native multimodality. |